Skip to content

Support cloud model paths (s3://, gs://) for fresh training runs#1113

Draft
bthecohen wants to merge 2 commits intoNovaSky-AI:mainfrom
DataDog:ben.cohen/cloud-policy-model-test
Draft

Support cloud model paths (s3://, gs://) for fresh training runs#1113
bthecohen wants to merge 2 commits intoNovaSky-AI:mainfrom
DataDog:ben.cohen/cloud-policy-model-test

Conversation

@bthecohen
Copy link
Contributor

@bthecohen bthecohen commented Feb 13, 2026

Summary

Adds support for using cloud storage paths (S3, GCS) as the initial model path when starting fresh training runs. Previously, cloud paths were only supported when resuming from checkpoints.

This enables workflows like:

  • Starting RL training from an SFT checkpoint stored in cloud storage
  • Staged curriculum learning where models are trained on one task, uploaded to cloud, then used as the starting point for the next stage
  • Using a shared model registry in S3/GCS without manually downloading models to each node first

Approach

Reuses the existing io.local_read_dir() context manager at each model loading site. For local paths, it's a no-op passthrough. For cloud paths, it downloads to a temp directory and yields the local path. Temp files are cleaned up after model weights are loaded into GPU memory.

Changes

  • io.py: Fix local_read_dir to handle s3fs nesting files under a subdirectory when downloading recursively
  • main_base.py: Wrap get_tokenizer() with local_read_dir
  • fsdp_worker.py: Wrap init_model() in Policy, Critic, and Ref workers with local_read_dir
  • megatron_worker.py: Wrap init_model() in Policy and Ref workers with local_read_dir (covers both init_configs and make_megatron_module which both need model files)
  • vllm_engine.py: Resolve cloud path in BaseVLLMInferenceEngine.__init__(), preserving original path as served_model_name
  • sglang_engine.py: Resolve cloud path in SGLangInferenceEngine.__init__() (same pattern as vLLM)

Design decisions

  • Config is not mutated — the cloud path stays in the config for model naming and equality comparisons
  • Each node downloads independently — no shared filesystem assumption
  • served_model_name preserved — inference engines use the original cloud path for API naming consistency

Testing

Tested on a 2-node (16x A100) Ray cluster with an S3-hosted Qwen2.5-1.5B-Instruct model on GSM8K:

# Strategy Colocate Nodes Result
1 fsdp2 true 2×8 PASS
2 fsdp2 false 1×8 + 8 engines PASS
3 megatron true 2×8 PASS
4 fsdp2 true 1×8 PASS

Not yet tested / follow-up needed

  • SGLang: Our Ray image doesn't include SGLang. The implementation follows the same pattern as vLLM (cloud path resolution inside the engine's __init__). Maintainers, please verify.
  • VLLMServerActor (HTTP inference path): The get_http_inference_client() entrypoint was added after this branch's base commit, so cloud path support for VLLMServerActor should be added as a follow-up once the HTTP inference codepath is the default.

🤖 Generated with Claude Code


Open with Devin

…ning runs

Wrap model loading sites with io.local_read_dir() so users can set
trainer.policy.model.path to a cloud path. For local paths this is a
no-op; for cloud paths it downloads to a temp dir and cleans up after
the model is loaded into GPU memory.

Sites wrapped: get_tokenizer, FSDP init_model (policy/critic/ref),
Megatron init_configs, vLLM engine init, SGLang engine creation,
and VLLMServerActor init.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for using cloud storage paths for initial model loading, which is a valuable enhancement for flexible training workflows. However, this feature, when combined with the hardcoded trust_remote_code=True setting, introduces a critical Remote Code Execution (RCE) vulnerability if an attacker can control the model path configuration. It is strongly recommended to make trust_remote_code a configurable parameter that defaults to False, requiring explicit opt-in for potentially untrusted code. Furthermore, improvements are needed in sglang_engine.py and vllm_engine.py to prevent potential resource leaks due to manual context manager handling, and clarification is requested for a removed logic in megatron_worker.py to ensure no unintended side effects.

I am having trouble creating individual review comments. Click here to see my feedback.

skyrl-train/skyrl_train/entrypoints/main_base.py (133)

security-critical critical

The application hardcodes trust_remote_code=True when initializing the tokenizer. Combined with the newly added support for cloud storage paths (S3, GCS), this allows for Remote Code Execution (RCE) if an attacker can influence the model path configuration. An attacker could point the application to a malicious remote model containing executable code in its configuration files, which would be downloaded and executed on the training worker. It is recommended to make trust_remote_code a configurable parameter that defaults to False.

skyrl-train/skyrl_train/workers/fsdp/fsdp_worker.py (119)

security-critical critical

The application hardcodes trust_remote_code=True when loading model configurations. This is particularly dangerous as this PR enables loading models from arbitrary cloud storage paths (S3, GCS). If an attacker can control the model_path, they can trigger the execution of arbitrary code embedded in the remote model's configuration. This setting should be configurable and default to False to prevent unauthorized code execution.

skyrl-train/skyrl_train/workers/fsdp/fsdp_worker.py (282)

security-critical critical

Hardcoded trust_remote_code=True allows for arbitrary code execution when loading models from the provided path. With the addition of cloud path support, this risk is extended to remote sources. This should be made configurable and default to False.

skyrl-train/skyrl_train/workers/fsdp/fsdp_worker.py (357)

security-critical critical

Hardcoded trust_remote_code=True allows for arbitrary code execution when loading models from the provided path. With the addition of cloud path support, this risk is extended to remote sources. This should be made configurable and default to False.

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py (431)

security-critical critical

The init_model method now supports loading models from cloud storage paths via io.local_read_dir. However, the underlying init_configs method hardcodes trust_remote_code=True when loading the tokenizer and config. This allows a remote attacker who can influence the model_path to achieve Remote Code Execution (RCE) by providing a malicious model. You should avoid hardcoding trust_remote_code=True and instead make it a user-controlled option that defaults to False.

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py (719)

security-critical critical

The init_model method now supports loading models from cloud storage paths via io.local_read_dir. However, the underlying init_configs method hardcodes trust_remote_code=True when loading the tokenizer and config. This allows a remote attacker who can influence the model_path to achieve Remote Code Execution (RCE) by providing a malicious model. You should avoid hardcoding trust_remote_code=True and instead make it a user-controlled option that defaults to False.

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py (254-258)

high

This manual handling of the context manager's __enter__ and __exit__ methods is not robust. If an exception occurs during the initialization of the Engine (line 283), the __exit__ method on line 288 will not be called, leading to a resource leak (the temporary directory will not be cleaned up).

To ensure cleanup happens reliably, you should wrap the code between __enter__ and __exit__ in a try...finally block.

Example:

self._cloud_model_ctx = None
if io.is_cloud_path(original_model_path):
    self._cloud_model_ctx = io.local_read_dir(original_model_path)
    kwargs["model_path"] = self._cloud_model_ctx.__enter__()

try:
    # ... existing initialization code ...
    self.engine = Engine(**kwargs)
    # ...
finally:
    if self._cloud_model_ctx is not None:
        self._cloud_model_ctx.__exit__(None, None, None)
        self._cloud_model_ctx = None

Since this change would affect lines not in the diff, I'm leaving it as a comment, but this is an important fix for resource management.

skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py (81-87)

high

Similar to the sglang_engine.py, the manual use of __enter__ and __exit__ can lead to resource leaks. If self._create_engine() on line 102 raises an exception, the temporary directory for the cloud model will not be cleaned up.

Please wrap the engine initialization logic in a try...finally block to ensure self._cloud_model_ctx.__exit__() is always called after __enter__().

Example:

self._cloud_model_ctx = None
if io.is_cloud_path(original_model_path):
    self._cloud_model_ctx = io.local_read_dir(original_model_path)
    kwargs["model"] = self._cloud_model_ctx.__enter__()
    # ...

try:
    # ... existing initialization code ...
    self.llm = self._create_engine(*args, **kwargs)
finally:
    if self._cloud_model_ctx is not None:
        self._cloud_model_ctx.__exit__(None, None, None)
        self._cloud_model_ctx = None

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py (202-208)

medium

This block of code, which overrides token IDs in the model configuration based on the tokenizer, has been removed. This change seems unrelated to supporting cloud model paths and is not mentioned in the pull request description.

Could you please clarify the reason for this removal? If this logic was necessary to ensure consistency between the tokenizer and the model config, removing it might introduce subtle bugs. If it's now redundant, a brief explanation would be helpful for future maintainers.

…mp dir cleanup

- Restore accidentally removed update_model_config block in MegatronWorker.init_configs
  that propagates token IDs from tokenizer and applies user model_config_kwargs
- Wrap vLLM and SGLang engine init in try/finally to ensure cloud model temp
  directories are cleaned up even if engine creation fails

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bthecohen bthecohen marked this pull request as draft February 18, 2026 03:28
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment on lines +130 to +135
with io.local_read_dir(self.cfg.trainer.policy.model.path) as model_path:
tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True,
use_fast=not self.cfg.trainer.disable_fast_tokenizer,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 local_read_dir raises FileNotFoundError for HuggingFace model names, breaking non-local/non-cloud paths

Wrapping model loading in local_read_dir breaks HuggingFace Hub model name support (e.g., Qwen/Qwen2.5-1.5B-Instruct). AutoTokenizer.from_pretrained natively resolves HF model names, but now local_read_dir raises FileNotFoundError before the HF method is ever called.

Root Cause and Impact across all affected call sites

local_read_dir at skyrl-train/skyrl_train/utils/io/io.py:196-198 checks exists(input_path) for non-cloud paths, which uses the local filesystem. A HuggingFace model ID like "Qwen/Qwen2.5-1.5B-Instruct" is neither a cloud path nor a local path, so is_cloud_path() returns False, then exists() returns False, and FileNotFoundError is raised.

Before this PR, AutoTokenizer.from_pretrained(self.cfg.trainer.policy.model.path) handled HF model names natively. Now the local_read_dir wrapper intercepts and fails first.

This same regression affects every new call site added by this PR:

  • skyrl-train/skyrl_train/workers/fsdp/fsdp_worker.py:103 (Policy), :269 (Critic), :346 (Ref) — AutoConfig.from_pretrained and HFModelWrapper both accept HF names natively
  • skyrl-train/skyrl_train/workers/megatron/megatron_worker.py:439 (Policy), :727 (Ref) — especially severe because the existing snapshot_download fallback (which explicitly handles HF model names) is now unreachable: local_read_dir raises before execution reaches it

Impact: Any user specifying a HuggingFace model ID in trainer.policy.model.path (a previously supported workflow) will get a crash at startup.

Prompt for agents
The fix needs to be applied at the local_read_dir function in skyrl-train/skyrl_train/utils/io/io.py (lines 195-198), not at each call site. The else branch currently raises FileNotFoundError for any path that doesn't exist on the local filesystem, but HuggingFace model names (e.g., "Qwen/Qwen2.5-1.5B-Instruct") are valid non-local, non-cloud paths that should be passed through unchanged. The simplest fix is to remove the existence check in the else branch of local_read_dir and just yield input_path directly for non-cloud paths. The downstream libraries (AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoBridge, etc.) all handle both local paths and HuggingFace model names already. Alternatively, if you want to keep validation for local paths, add a check that distinguishes local filesystem paths from HF model names (e.g., check if the path contains os.sep or starts with / or .).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +439 to +463
with io.local_read_dir(model_path) as model_path:
# initialize the bridge and provider objects
self.init_configs(
model_path,
self.cfg.trainer.policy.megatron_config,
self.cfg.trainer.policy.megatron_config.model_config_kwargs,
self.cfg.trainer.policy.megatron_config.transformer_config_kwargs,
bf16=self.cfg.trainer.bf16,
flash_attn=self.cfg.trainer.flash_attn,
)

# wrap with DDP for training
self.actor_module = self.make_megatron_module(
wrap_with_ddp=True,
ddp_config=self.cfg.trainer.policy.megatron_config.ddp_config,
lora_config=self.cfg.trainer.policy.model.lora if self._is_lora else None,
lora_type=self.cfg.trainer.policy.megatron_config.lora_config.lora_type,
bf16=self.cfg.trainer.bf16,
)
# wrap with DDP for training
self.actor_module = self.make_megatron_module(
wrap_with_ddp=True,
ddp_config=self.cfg.trainer.policy.megatron_config.ddp_config,
lora_config=self.cfg.trainer.policy.model.lora if self._is_lora else None,
lora_type=self.cfg.trainer.policy.megatron_config.lora_config.lora_type,
bf16=self.cfg.trainer.bf16,
)

if self._local_rank == 0 and not os.path.exists(
model_path
): # if not local path, try downloading model weights from huggingface
snapshot_download(model_path) # will be no-op if already downloaded
torch.distributed.barrier()
if self._local_rank == 0 and not os.path.exists(
model_path
): # if not local path, try downloading model weights from huggingface
snapshot_download(model_path) # will be no-op if already downloaded
torch.distributed.barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Megatron snapshot_download fallback unreachable inside local_read_dir for HuggingFace model names

The Megatron policy worker's snapshot_download fallback at lines 459–462 is now inside the local_read_dir context. When model_path is a HuggingFace model name (not a cloud path and not a local path), local_read_dir raises FileNotFoundError at skyrl-train/skyrl_train/utils/io/io.py:197-198 before execution ever reaches the snapshot_download call.

Detailed Explanation

Before this PR, the code handled HF model names gracefully:

self.init_configs(model_path, ...)  # AutoConfig/AutoBridge handle HF names
self.actor_module = self.make_megatron_module(...)
if self._local_rank == 0 and not os.path.exists(model_path):
    snapshot_download(model_path)  # downloads HF model for weight loading
torch.distributed.barrier()

After this PR, wrapping in local_read_dir means FileNotFoundError is raised before init_configs is even called. The snapshot_download code at lines 459-462 is dead code for the HF model name case.

The same issue affects MegatronRefWorkerBase.init_model at skyrl-train/skyrl_train/workers/megatron/megatron_worker.py:727 where snapshot_download at lines 745-748 is similarly unreachable.

Impact: Megatron workers can no longer be initialized with HuggingFace model names — a previously working workflow.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant